Using Natural Language Processing and the Gene Ontology to Populate a Structured Pathway Database
نویسندگان
چکیده
Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the Gene Ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www.cs.utexas.edu/users/ml/rapier.html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted Genes and Proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization..
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملAutomatic Ontology Population extracted from SAM Healthcare Texts in Portuguese
We describe a proposal of the steps needed to automatically extract the information about healthcare providing activities from an actual EHR at use in a Portuguese Region (Portalegre) to populate an Ontology. We present the steps to manually and further automatically populate using a suggested Software Architecture and the appropriate Natural Language Processing techniques for Portuguese Clinic...
متن کاملMapping of TP53 protein network using cytoscape software
TP53 acts as a tumor suppressor in cancer. It induces cell cycle arrest or apoptosis in response to cellular stress and damage. p53 gene alteration could cause uncontrolled cell proliferation.In the present study, we used TP53 gene as the seed in the construction of a protein-protein functional association network to identify genes that might involve in tumorgenesis process with TP53. TP53 prot...
متن کاملUsing the Protein-protein Interaction Network to Identifying the Biomarkers in Evolution of the Oocyte
Background Oocyte maturity includes nuclear and cytoplasmic maturity, both of which are important for embryo fertilization. The development of oocyte is not limited to the period of follicular growth, and starts from the embryonic period and continues throughout life. In this study, for the purpose of evaluating the effect of the FSH hormone on the expression of genes, GEO access codes for this...
متن کاملطراحی سامانه هوشمند ساخت هستان نگار به کمک شبکه عصبی ARTو روشC-value
In recent years, many efforts have been done to design ontology learning methods and automate ontology construction process. The ontology construction process is a time-consuming and costly procedure for almost all domains/applications, so automating this process is a solution to overcome the knowledge acquisition bottleneck in information systems and reduce the construction cost. In this artic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003